home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 1
/
Cream of the Crop 1.iso
/
PROGRAM
/
CUJ9208.ARJ
/
RAMEY.EXE
/
PSORT.NR
< prev
next >
Wrap
Text File
|
1991-09-04
|
24KB
|
560 lines
.sp 15
.ce 3
The Postman's Sort
.sp 1
Robert Ramey
.sp 3
.ce 1
abstract
.sp 1
This article describes a program that sorts an arbitrarily large
number of records in less time than any algorithm based on comparison
sorting can. For many commonly encountered files, time will be
strictly proportional to the number of records.
It is not a toy program. It can sort on an arbitrary
group of fields with arbitrary collating sequence on each field faster
than any other program available.
.bp
.SH Introduction
.PP
Forget for a moment everything you know about computer science.
.PP
If you were given a stack of playing cards and asked to sort
them by suit, how would you go about it.
.PP
Most people would deal the cards into four piles, one for each
suit. After dealing the last card the piles would be picked up
one after the other.
Sorting a deck of cards this way into four suits would take
about a minute.
.PP
Suppose that you were given 100 decks of cards into suits.
Would you use
a different method? How long would it take you to do it.
It should be pretty clear that the same method would be used
by most people and that it would take 100 times as long as for
sorting one deck of cards into four suits.
.PP
So here we have a sorting method which will time strictly
proportional to the number of items to be sorted. This has
been known as distribution sorting.
.PP
Most computer professionals "know" that time to sort a file
of records must increase disproportionately to the number of records
in the file. Its a good thing that the rest of us don't know it.
We might never get our mail.
.PP
In fact, most will cite Donald E. Knuth's book
.UL
The Art of Computer Programming:Searching and Sorting
Knuth's book contains a chapter on minimum comparison sorting
which shows that a sorting method which compares records
can do no less than n lg2(n) (note: that's log base 2) comparisons.
However, this analysis
specifically excludes methods that to not use key comparison.
In fact, other sections of the book allude to methods whose time
is proportional to the number of records sorted.
.PP
Then, why does it seem that every sort utility uses quicksort?
Practical sort utilities have to be able to handle a wide
variety of keys and key types and be fast. The methods described
in Knuth's book that to not use comparison have depended on
the use of fixed length keys. Also, although they might be
faster for a sufficiently large number of records, they are slow
on the numbers of records usually encountered.
For example, to sort a group of records on
a social security number using radix list sort would require
30 passes through the file!
.PP
There is a way to overcome these deficiency. In fact the
post office uses it every day. While researching this article
(after writing the program), I found that Knuth alluded to it
and left it as an exercise for the reader.
.SH "A Generalized Distribution Sorting Algorithm"
.PP
When a postal clerk recieves a huge bag of letters he distributes
them into other bags by state. Each bag gets sent to the
indicated state. Upon arrival, another clerk distributes
the letters in his bag into other bags by city. So the
process continues until the bags are the size one man
can carry and deliver. This is the basis for my sorting method
which I call the postman's sort.
.PP
Suppose we are given a large list of records to be ordered
alphabetically on a particular field.
Make one pass through the file. Each record
read is added to one of 26 lists depending on the first
letter in the field. The first list contains all the
records with fields starting with the letter "A" while the last
contains all
the records with fields starting with the letter "Z".
Now we have divided the
problem down to 26 smaller subproblems.
Now we address subproblem of sorting all the records in the sublist
corresponding to key fields starting with the letter "A".
If there are no records in the "A" sublist we can proceed to
deal with the "B" sublist.
If the "A" sublist contains only one record it can be written
to output and we are done with that sublist.
If the "A" sublist contains more that one record, it must be sorted
then output. Only when the "A"
list has been disposed of we can move on to each of the other
sublists in sequence. The records will be written to the output
in alphabetical order. When the "A" list contains more than
one record it has to be sorted before it is output. What sorting
algorithm should be used? Just like a real postman, we use the
postman's sort. Of course we just apply the method to the second
letter of the field.
This is done to greater and greater depths until eventually
all the words starting with "A" are written to the output.
We can then proceed to deal with sublists "B" through "Z"
in the same manner.
.PP
The above algorithm is implemented in the accompanying program.
The function sort() is passed a sublist. First it is determined
how many records the sublist has. If its one we're done after
writing the record. If the sublist is empty we can just return.
Otherwise we advance the pointer into the sort key and continue.
plan() and allocate() reserve memory on the stack
of sublist pointers. dist() distributes the input sublist
among the newly created sublists. dsort() calls sort()
for each of the newly created sublists.
.SH "Reality Check"
.PP
A practical sort utility must be able to sort on any fields in any sequence
with any collating sequence. The command line syntax permits
specification of delimited fields, start of key within each
field, and specification of collating sequences for each field.
See the accompanying manual page for the sort program.
Sorting proceeds according to each key field.
The first character within a field which corresponds to a
zero collating sequence terminates the field. The rest of
the field is not considered in the sort. Sorting continues
on remaining key fields.
This means that "A<0>C" might be written to output before
"A<0>B" in the final output.
If this behavior is not desired,
one should make sure that every character that could appear
within a field is assigned a non-zero collating sequence value.
.PP
To be really useful the program must take into account that
records can have varying numbers of fields, fields can
have varying number of characters and that the command line
may specify fields and/or key characters that a particular
record does not contain.
In general keys corresponding to non-existant fields are
distributed with a collating sequence of zero. This
turns out to be more natural, convenient and flexible
of the possibilities considered.
This introduces some complications which require refinements
of the algorithm. In
.PP
After distribution is completed by dist(). The newly created
sublists are processed by dsort(). Sublists corresponding
to a zero collating value are handled first.
For these sublists, dsort() advances key pointers to the
next key field before continuing to distribute these sublists.
The remaining sublists are handled as previously described.
.SH "Speeding things up"
.PP
Is this the best we can do? Suppose we wanted to sort 100 decks
of shuffled playing cards by value and suit. We could first
distribute by suit and then distribute each pile by value.
But we probably wouldn't. What we would probably do is to
arrange 4 rows and 13 columns. Each card would be dealt
to the pile corresponding to its suit and value. All the cards
could be sorted in one pass.
.PP
When we're sorting on the basis of a key field we can do the
same thing by distributing among a larger number of sublists.
In one pass we would have sublists
"\0?", "A\0", "AA", "AB",...,"AZ", "B\0", "BA", ...,"ZY","ZZ".
Of course information on 1 + 26 * (1 + 26) = 703 sublists
must be maintained in memory, but the effect is to reduce the number
of times the key has to be addressed by half.
The basic principle is to
sort on the basis of as many bytes at a time as possible.
If we want to include all 8 bit bytes in the collating sequence
we are going to need 64k sublists to sort on two bytes at a time.
This usually not going to be too practical (at least on Intel type
processors where the maximum size data segment is 64k).
Fortunately, in the
real world we usually know something about the fields we want to sort on.
For example, to create a telephone directory, we sort on peoples
names last name first. These fields will only contain alphabetic
characters. Hence for this purpose we can take two bytes at a time.
Social security numbers only contain decimal digits so we can
take 3 or 4 at time before things get out of hand.
.PP
The function plan() called from sort() determines the best number
of bytes to take given the collating sequences defined for the
key fields. It also sets up pointers to be able to find the key fields.
The program is designed so that we can take bytes from
more than one field on one pass. This should assure us that the
maximum number of bytes possible has been used to distribute
the record each time it is addresses.
To sort a file of records on a social security number in the
third field in format 999-99-9999 one could use the command line
asort -k '0'-'9' -f 2 -c 0 -c 4 -c 7
The default setup for asort will allocate sublist space to handle
three decimal digits. The first time a record is addressed
it will be distributed on the first three digits.
The second time '-' will be encountered at the next key
position so that all the records with the same first three digits
will added to sublist 0.
the third time distribution will be on the middle two digits. Enough
space will be allocated to distribute on three digits but only
a small portion will be used as each field will be terminated by the '-'.
The fourth distribution will be on the next three digits and the
fifth will be on the last digit and field terminator.
The best command line to sort a file of records based on
a social security number in the third field in format 999-99-9999
would be
asort -k '0'-'9' -f 2 -c 0-2 -c 4-5 -c 7-10 .
In this case sort would pass only three times through the records.
In general, the more information we supply in the command
line the faster the sort is likely to be executed.
.SH "Storage Management"
.PP
Sublists are implemented as linked lists of records so that moving
records around is fast and simple. As records are read from
the standard input they are processed by the first
distribution pass. If all the records don't fit in memory
a temporary disk file containing the sublists is created.
Compiling with the MicroSoft C v5.21 compact memory model(that's
small code/large data),
it turns out that for an average fill with random alphabetic keys, a given
record will be written and read on/from the temporary file
at most once if the
file to be sorted is less than 280MB. Of course if we're using MSDOS
we run in to other problems first.
.PP
As usual a disproportionate part of the program is taken up
with issues of memory management. Even so, won't go into it
here as it peripheral to the subject of this article.
.SH "Comparing apples and oranges"
.PP
Below I offer an informal analysis of the speed of the postman's sort
for some different types of files.
Comparative analysis of the comparison sorting and the postman's sort
is somewhat difficult since the methods are so different. I have chosen
to consider "primitive operations". By primitive operation I mean
each time a record is addressed and/or manipulated in some way.
If the primitive operation has some sort of inner loop such as
a string comparison does, I count each cycle of the inner loop as
a primitive operation.
.PP
For comparison sorts this will be comparing two bytes
and perhaps moving records. For the postman's sort
this will be manipulation of a byte from the key field.
Of course, this is going to be a rough
approximation but I think it will still be worthwhile.
.PP
First consider the case of a relatively short fixed length key is
used for the sort. eg. sorting accounting transactions by date in
format mmdd.
The number of days in a month or year is small enough so that
we would expect to complete the job in one pass through the file
distributing records among 2*10*4*10 = 800 sublists. This is determined
by using a key of command line of
-k '0'-`1' -c 0 -k '0'-'9' -c 1 -k '0'-'3' -c 3 -k '0'-'9'
In this case the number of primitive operations required by the postman's
sort will be equal to the number of records to be sorted times four operations
per record as there are four bytes in the record. Note that this is an
example of a sorting method which requires time proportional to the
number of records in the file. And its not just a toy problem.
.PP
Using any comparison sorting method
the number of comparisons will be at least N*lg2(N). Since the number of
records far exceeds the number of keys, The number of primitive operations
per comparison will be close to 4. Using 3 as en estimate the
total number of primitive operations will be about 3*N*lg2(N).
The postman's sort will be faster than the fastest comparison sort
by the following factor:
.nf
3*N*lg2(N)
--------- = .75*lg2(N)
4*N
.fi
For 100,000 records the fastest comparison sort will require 12 times more
primitive operations than the postman's sort.
.PP
Next consider the case of a relatively long variable length key.
Suppose we use as an example keys made of random alphabetic characters.
Using the postman's sort and taking two characters at a time,
the first pass through the file will generate 703 subfiles. Each subfile
will in turn be distributed among 703 subfiles, etc.
until each subfile has only
one record. Hence each record will be addressed approximately
lg703(N) times where N is the number of records in the file. The total number
of primitive operations will be 2*N*lg703(N).
.PP
Using a comparison sort,
most of the comparisons are between keys that are close together.
If there are less the 26 records in the file we would expect most
comparisons to terminate on the first byte. If there are less than
26*26 records in the file most comparisons will terminate on the second
byte and so on. This works out to lg26(N) primitive operations
for each comparison. The number of comparisons is at least N*lg2(N).
So the total primitive operation would be N*lg2(N)*lg26(N)
.PP
For this file the postman's sort will be faster than the comparison
sort by the following factor.
.nf
N*lg2(N)*lg26(N) lg2(N)*ln(703)
---------------- = -------------- = .9*lg2(N)
(2*N*lg703(N)) 2*ln(26)
.fi
or almost 16 times faster for a 100,000 record file.
.SH "Just How Fast Is It"
.PP
The above analysis is rather crude in that it makes a number of
assumptions to make the calculations easier.
The above analysis assumes that all types of "primitive operations"
take the same amount of time to complete. Also a practical program
spends a lot of time in storage management overhead which the theoretic
approach doesn't take in to account. Next let's look at some real
results.
In order to test this sorting program, I generated test files consisting
of variable length records of random alphabetic characters. The average
record length is 15 bytes long. The files were 1000, 10,000 and 100,000
records long. (15K, 150K, and 1.5MB long). For comparison purposes I
used the sort program from the MKS Toolkit from Mortice Kern Systems.
I have assumed that it uses some sort of comparison sort.
The results are summarized
in the following table
.nf
.ta
.ta 35R,50R,63R
Comp. Sort Postman's Sort Speed Factor
Time fixed key 3*N*lg2(N) 4*N .75lg2(N)
Time Variable key N*lg2(N)*lg26(N) 2*N*lg703(N) .9*lg2(N)
Sort 1K records 3 sec 3 sec 1.0
Sort 10K records 27sec 18 sec 1.5
Sort 100K records 558sec 290 sec 1.92
.fi
My machine is a 12 MegaHertz AT compatible with two 28ms Seagate 251-1
disk drives. One drive was used for temporary files. The number of
drives used affected greatly the time for comparison sort of a large
file. It had little effect on the time required for the postman's
sort.
.PP
Well it looks like the postman's sort isn't as fast my theory predicts.
Buts its still twice as fast at the next best thing.
.SH "Final Observations"
One of the most gratifying things about the postman's sort is that
the first records are produced on the standard output before the
file has even completed reading! In our test file, records which
contained just a newline character are output as they are read in.
When the last record is read in the first non null records appear
almost immediately. If one is working with a multi-tasking system
with pipes, much time will be saved as the subsequent process can
proceed parallel with the sorting process. So even though this
sort is only twice as fast, the jobs that use it may be improved
much more.
.PP
Well, there you have it. A program which sorts a file of N records
in less time
than it takes to make N*lg2(N) comparisons. For a large and common
class of sorting keys, it will sort a file in a time proportional the
the number of records in the file.
.bp
NAME
psort - sort the standard input file
SYNOPSIS
psort [-t <dir>] [-s <record size>]
.br
[ [-k <keys>] ] [ [-f <range>] [-c <range>]... ]...
DESCRIPTION
psort sorts lines of the standard input file and writes the result
on the standard output. Default key is the entire line.
Default ordering is lexicographic by bytes in machine
collating sequence.
.in +4
.ti -4
-t use the following name for the temporary directory.
Default is taken from environment variable %TMP%
.ti -4
-s record size.
if a range (eg. 10-124) is specified, records will be of variable
length delimited by a newline. If a record larger than the indicated
maximum size is encountered the program will terminate with and error
message.
If a single value is specified
records will be fixed in length and the newline will have no special
significance.
If no record size is specified, variable length records upto 511
bytes are assumed.
.ti -4
-k specify the collating sequence for the subsequently specified
sorting fields. The collating sequence is specified as one
or more ranges of values. Characters are assigned collating
sequence in order of their specification. For example,
to sort lower case alphabetic characters use -k 'a'-'z'.
Any characters not specified will be assigned a collating
value of 0. Characters in a field beyond a character with
a 0 collating value will not be included within the sorted field.
Hence, either of records b<0>a and b<0>c may precede the other
in the output file.
Within a key specification, any number of ranges may be
specified. For example, if the sorting field will contain
any combination of lower case letters, digits and spaces,
use -k ' ' '0'-'9' 'a'-'z'. Spaces will sort before digits
which will sort before lower case letters. If no key
collating sequence is specified, a default collating
sequence of all printable ascii characters is used.
.ti -4
-r repeat previous collating sequence. For example, to fold
upper case letters to lower case letters for purposes of
determining sorting priority use -k 'a'-'z' -r 'A'-'Z'.
This would assign the first character following the -r
the same collating value as the first one assigned in the
previous range. To give varying white space characters equal
weight use -k ' ' -r '\t' -r '_' .
.ti -4
-n numeric sort on the key. This is an alternative to -k.
numeric fields may contain a leading sign and/or decimal point.
Numeric fields should look like
[' ']...[+|-]['0'-'9']...[.]['0'-'9']...
Any non-numeric characters will terminate the field.
The field will be sorted by numeric value.
.ti -4
-i invert the sequence of the sort for the last key specified.
.ti -4
-u output only records that are unique according to the
sorting key fields.
.ti -4
-f sort on one or more fields. Fields are groups of
characters separated by a delimiter character.
Fields are number starting at 0. That is -f 0
refers to the start of the record.
A field specification may contain a range of
fields as in -f 2-4 to indicate that sorting
sequence is to be determined on the basis of
the third, fourth and fifth fields in turn.
A range must have a definite end. ie -f 2-
is not permitted. A field range need not
be increasing. ie -f 3-2 is permited and
will sort first by the fourth then by the
third field.
.ti -4
-c sort one or more characters within the indicated
fields. Start counting character positions from 0.
For example -f 1 -c 2-3 would sort on the third and
fourth characters of the second field. Several
character ranges may be specified for a given field.
For example -f 2 -c 5-6 -c 3-4 -c 1-2 would specify
three sorting fields of 2 characters each with the
third delimited field. When specifying a character
range within a field, the second number must be greater
or larger than the first. ie. -c 7-3 cannot be used.
An indefinite character range can be specified as
in -c 4- . This will indicate all characters starting
with the fifth to the end of the field.
.ti -4
<range> a range is used to specify ranges of fields, displacements
within a field and collating values. The common syntax is
<start>[-[<end>]] . <start> indicates a single value. <start>-
indicates a range beginning at <start> to a large number. For
example -f 2- would be used to specify all fields after the
second. <start>-<end> indicates a range of fields. The
start and end number can be in a number of formats:
simple decimal numbers, numbers starting with 0 are taken
to be octal, numbers starting with 0x are taken to be
hexidecimal and characters within apostrophes are converted
to there ascii value. Hence -k ' ' 'a'-'z' and -k 0x20 'a'-122
are equivalent.
.ti -4
-d the following character is the field delimiter. For example
-d '|' . The default field delimiter is a tab (0x09).
.in -4
If no sorting fields are specified, the whole record is taken as
a sorting field.
Sorting proceeds according to the precedence indicated by
the sequence of the sorting fields.
Records with the same sorting fields
will be output in an unpredictable sequence.
Remember that characters not specified within a collating
sequence are taken as collating value zero.
This can result in unexpected behavior when fields
are not the same length. Following is the result of
sorting a small file with -k 'z'-'a'.
.nf
def
cad
basdf
a
aa
.fi
This was probably not the result intended. To get the
desired result, use -k 'a'-'z' -i.
.nf
def
cad
basdf
aa
a
.fi
The following switches normally need not be used. They are included
for purposes of fine tuning and debugging.
.in +4
.ti -4
-m maximum memory in K to be allocated for sort. This can be useful
in a multitasking environment so that psort doesn't suck up all the
memory available in the system.
.ti -4
-b specify the size of the buffers used for standard input/output in K.
The default size is 30.
.ti -4
-sb specify the size of the output buffer used to write data to
the temporary file. Current value is 30.
.ti -4
-rb specify the size of the input buffer used to read data from the
temporary file. Current value is 30.
.ti -4
-v specify visible mode. This displays statistics on each distribution
pass in the file. It is useful for debugging and fine tuning. If only
the top levels of distribution are desired use -v <number of levels>.
.ti -4
-l length of segment used by internal storage in K. Current value is 16.